An information retrieval approach to document sanitization

ثبت نشده
چکیده

In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main and independent steps: (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. For empirical tests, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. URL http://dx.doi.org/10.1007/978-3-319-09885-2_9 [11] Source URL: https://www.iiia.csic.es/en/node/54256 Links [1] https://www.iiia.csic.es/en/staff/david-f-nettleton [2] https://www.iiia.csic.es/en/staff/daniel-abril [3] https://www.iiia.csic.es/en/bibliography?f[keyword]=465 [4] https://www.iiia.csic.es/en/bibliography?f[keyword]=460 [5] https://www.iiia.csic.es/en/bibliography?f[keyword]=464 [6] https://www.iiia.csic.es/en/bibliography?f[keyword]=447 [7] https://www.iiia.csic.es/en/bibliography?f[keyword]=461 [8] https://www.iiia.csic.es/en/bibliography?f[keyword]=463 [9] https://www.iiia.csic.es/en/bibliography?f[keyword]=462 [10] https://www.iiia.csic.es/en/bibliography?f[keyword]=466 [11] http://dx.doi.org/10.1007/978-3-319-09885-2_9

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

An Information Retrieval Approach to Document Sanitization

In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main steps: (i) identifying and anonymizing sp...

متن کامل

Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables

In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-auto...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017